Skip to content

Wait for MARKER_RECORDED to fire version callback on replay#2821

Merged
eamsden merged 18 commits into
masterfrom
eamsden/interleaved-UpdateCompleted-getversion
Jun 8, 2026
Merged

Wait for MARKER_RECORDED to fire version callback on replay#2821
eamsden merged 18 commits into
masterfrom
eamsden/interleaved-UpdateCompleted-getversion

Conversation

@eamsden

@eamsden eamsden commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

What was changed

  • Changed VersionStateMachine replay behavior behind a new SDK flag VERSION_WAIT_FOR_MARKER so getVersion() no longer resumes workflow code when the fake RECORD_MARKER command is created during replay.
  • Replay now waits until the real MARKER_RECORDED event is matched before firing the version callback.

Why?

Replay was resuming getVersion() too early relative to update completion protocol handling. In the interleaved update history from issue #2796, that allowed replay to queue the second version marker before the update completion work had been matched, which produced a replay-only nondeterminism error:

[TMPRL1100] getVersion call before the existing version marker event

Delaying the replay callback until the actual marker event is matched makes version replay ordering consistent with replayed side effects and fixes replay for the recorded history.

Checklist

  1. Closes Replay fails: UpdateCompleted events between version markers breaks replay #2796 but does not fix replay of existing histories.

  2. How was this tested:

  • ./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionWithoutCommandEventTest" --tests "io.temporal.workflow.versionTests.GetVersionAndTimerTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionInSignalTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest" --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
  • ./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
  • Verified that the original repro history now replays successfully through both WorkflowReplayer and the lower-level direct-query replay task handler.
  1. Any docs updates needed?
  • No docs updates needed. This is an internal replay-ordering fix with regression tests.

@eamsden eamsden changed the title Eamsden/interleaved update completed getversion Wait for MARKER_RECORDED to fire version callback on replay Mar 30, 2026
@eamsden eamsden marked this pull request as ready for review March 30, 2026 17:51
@eamsden eamsden requested a review from a team as a code owner March 30, 2026 17:51
@eamsden eamsden force-pushed the eamsden/interleaved-UpdateCompleted-getversion branch from ff59a05 to 43ab9df Compare March 31, 2026 17:08
Comment thread temporal-sdk/src/main/java/io/temporal/internal/common/SdkFlag.java
@eamsden eamsden force-pushed the eamsden/interleaved-UpdateCompleted-getversion branch 2 times, most recently from 31e4195 to 316a40b Compare April 1, 2026 21:42
@eamsden eamsden marked this pull request as draft April 1, 2026 21:52
@eamsden

eamsden commented Apr 1, 2026

Copy link
Copy Markdown
Contributor Author

Converting to draft until at least the shut-off of SKIP_YIELD_ON_VERSION is reverted.

eamsden added 12 commits June 3, 2026 16:24
Add a Java replay test that mirrors the Kotlin GreetingWorkflow sample and replays the provided workflow history fixture.

Assert that replay fails with the embedded TMPRL1100 NonDeterministicException message so the reproducer stays pinned to the reported failure mode.
Run a constrained experiment for the interleaved update replay bug by changing VersionStateMachine replay timing only for histories with SKIP_YIELD_ON_VERSION set. In that path, getVersion still returns synchronously, but the replay callback is no longer fired at fake RECORD_MARKER command creation and is instead delayed until the real MARKER_RECORDED event is matched.

The goal of the experiment was to verify that flagged histories do not depend on the current early replay callback or its extra eventLoop scheduling. The legacy interleaved update repro history does not have SKIP_YIELD_ON_VERSION, so it continues to fail unchanged and serves as the control case.

Verified with:
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionWithoutCommandEventTest" --tests "io.temporal.workflow.versionTests.GetVersionAndTimerTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionInSignalTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest" --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest"
Change VersionStateMachine replay semantics so getVersion no longer resumes workflow code when the fake RECORD_MARKER command is created. Replay now waits until the real MARKER_RECORDED event is matched before firing the version callback, which makes version-marker ordering consistent with replayed side effects.

This fixes the interleaved update replay bug reproduced by testGetVersionInterleavedUpdateReplayHistory.json. That history previously failed replay with [TMPRL1100] because the second getVersion callback ran ahead of update completion protocol handling. After this change, the same recorded history replays successfully through both WorkflowReplayer and the lower-level direct-query replay task handler.

The earlier flag-gated experiment showed that delaying the callback was already safe for histories with SKIP_YIELD_ON_VERSION. This commit removes that temporary gating and applies the same replay ordering to all histories.

Verified with:
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingRemoveTest" --tests "io.temporal.workflow.versionTests.GetVersionRemovedInReplayTest" --tests "io.temporal.workflow.versionTests.GetVersionWithoutCommandEventTest" --tests "io.temporal.workflow.versionTests.GetVersionAndTimerTest" --tests "io.temporal.workflow.versionTests.GetVersionMultipleCallsTest" --tests "io.temporal.workflow.versionTests.GetVersionInSignalTest" --tests "io.temporal.workflow.versionTests.GetVersionMultithreadingTest" --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
./gradlew --offline :temporal-sdk:test --tests "io.temporal.workflow.versionTests.GetVersionInterleavedUpdateReplayTest" --tests "io.temporal.internal.replay.GetVersionInterleavedUpdateReplayTaskHandlerTest"
have elected not to fix that specific history. Make sure there is a
run-then-replay regression test which confirms new histories won't be
broken that way.
@eamsden eamsden force-pushed the eamsden/interleaved-UpdateCompleted-getversion branch from 316a40b to 648d583 Compare June 3, 2026 21:31
@eamsden eamsden marked this pull request as ready for review June 3, 2026 21:33
@eamsden eamsden enabled auto-merge (squash) June 3, 2026 22:05
Comment thread .github/workflows/ci.yml Outdated
@@ -1,5 +1,6 @@
name: Continuous Integration
permissions:
checks: write

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was part of fixing CI flakes. I think in this case It was for

https://github.com/temporalio/sdk-java/pull/2821/changes/BASE..ff8f61ae3cd4cbc030ec2469d1f63eb65458b39d#diff-b803fcb7f17ed9235f1e5cb1fcd2f5d3b2838429d4368ae4c57ce4436577f03fL56

to reliably succeed, but I'm checking that recall.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out not to be necessary, sorry this slipped through.

Comment thread temporal-sdk/src/main/java/io/temporal/internal/common/SdkFlag.java Outdated
eamsden and others added 2 commits June 8, 2026 14:31
Co-authored-by: James Watkins-Harvey <mjameswh@users.noreply.github.com>
@eamsden eamsden merged commit 4d53976 into master Jun 8, 2026
34 checks passed
@eamsden eamsden deleted the eamsden/interleaved-UpdateCompleted-getversion branch June 8, 2026 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replay fails: UpdateCompleted events between version markers breaks replay

2 participants